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Abstract — We propose a new distributed algorithm for com- 
puting a truncated Newton method, where the main diagonal 
of the Hessian is computed using belief propagation. As a 
case study for this approach, we examine the sensor selection 
problem, a Boolean convex optimization problem. We form two 
distributed algorithms. The first algorithm is a distributed version 
of the interior point method by Joshi and Boyd, and the second 
algorithm is an order of magnitude faster approximation. As an 
example application we discuss distributed anomaly detection in 
networks. We demonstrate the applicability of our solution using 
both synthetic data and real traffic logs collected from the Abilene 
Internet backbone. 

I. Introduction 

Interior point methods are efficient algorithms for solving 
convex optimization problem. Typically they converge faster 
than gradient based methods, since they exploit information 
from the second derivative (the Hessian) about changes in 
the gradient to speed up convergence. However, in practice 
it they are not frequently deployed to large system because 
they involve inverting the Hessian which has a cubic cost in 
the number of variables. Truncated Newton methods are an 
approximation where only partial information from the Hessian 
is used. For a good overview of interior point methods see [1]. 

In this work, we propose an efficient way for approximating 
the main diagonal of the Hessian inverse using belief propaga- 
tion. This provides us with a fast approximation to the interior 
point method, without fully inverting the Hessian matrix. As a 
case study for our approach, we analyze the sensor selection 
problem, which is a boolean convex optimization problem; 
given m sensor measurements we aim at finding k < m 
measurements that minimize the log volume of the resulting 
confidence ellipsoid, a scalar quantity of the uncertainty in the 
data. This problem is known to be NP-hard [2]. Recent work 
by Joshi and Boyd [3] proposes a solution of the relaxed sensor 
selection problem using an efficient interior point method. 
This work further gives a good overview of previous work 
and related algorithms in different domains. Other work by 
Karuse et aJ. [4] discuss the related sensor placement problem, 
and tackle the problem differently by minimizing the mutual 
information between the sensor placements selected. 

Most of the existing algorithms for solving the sensor selec- 
tion problem are not distributed. One of the few exception is 
the work of [5], which proposes a heuristic for sensor selection 



in sensor network context. We believe it is important to address 
the problem of distributed sensor selection. The amount of 
data collected from sensors is rapidly growing and centralized 
algorithms will not be able to process this vast amount of data. 
Recent survey paper by Hellerstein [6], discusses the related 
quantitative database cleaning problem in large scale databases. 
As we show in the current paper, the sensor selection problem 
is closely related to the minimum volume enclosing ellipsoid 
problem, a technique that is commonly used to minimize the 
uncertainty of stored data in large databases. Additional reason 
to have a distributed sensor selection algorithm is when the data 
is generated by a distributed set of sensors and it is not desirable 
to ship the collected data to one central processing node. In 
this case a distributed algorithm for selecting the "best" set 
of measurements is preferable. As a specific example of such 
a scenario, we discuss the problem of distributed anomaly 
detection in communication networks. 



In the current work we extend the previous construction of 
[3] by forming two distributed algorithms. The first algorithm 
is a distributed version of the centralized interior point method 
by Joshi and Boyd. The second algorithm is a fast and 
light-weight distributed approximation, which is a truncated 
Newton method. An advantage of the interior point method 
over the heuristic algorithms is that performance bounds can 
be computed along the run. We demonstrate the applicability 
of our solution using both synthetic data and real traffic logs 
collected from the Abilene Internet backbone. 



The structure of this paper is as follows. In Section I we 
introduce the sensor selection problem and its linear relaxation. 
In Section II we describe the Gaussian belief propagation 
algorithm, an efficient distributed iterative algorithm, which is 
the basis to our construction. In Section III we present our 
novel distributed algorithm, which is a distributed version of 
the Newton method proposed in [3]. Section IV presents our 
light-weight approximation for the truncated Newton method. 
Section V brings experimental results that compare the accu- 
racy of both our algorithms and the original algorithm proposed 
in [3]. We conclude in Section VI. 



A. Problem settings 

Suppose we are to estimate a vector from linear measure- 
ments, corrupted by additive noise 



Ax + w 



(1) 



where x G M. n is a vector of parameters to estimate, and 
w £ M m is an independent identically distributed AWGN 
noise J\f(0, a 2 I). We assume that A mxn has full column rank 
(m > n). The maximum likelihood estimator of x is 



x 



(A T A)~ 1 A T y 



The 77-confidence ellipsoid for x — x is the minimum volume 
ellipsoid that contains x x with probability r\. The r\- 
confidence ellipsoid is given by 

E a = {z\z T a 2 (A T A)- 1 z < a} , 

where a = F~n is the cumulative distribution function of a y 2 
random variable with n degrees of freedom. A scalar measure 
of the quality of estimation is the volume of the //-confidence 
ellipsoid: 

vol(£ Q ) = £det(<j- 2 A T A) , 

where £ is the volume of the unit ball in R™ [7]. The log volume 
of the confidence ellipsoid gives a quantitative measure of how 
informative the collection of m measurements is: 

logvol^o.) =P-\ logdct(A T yl) , 

where (3 is a constant that depends only on a, n, and 77. 

Given m measurements, we would like to choose a subset 
n < k < m of them that minimizes the log volume of 
the resulting confidence ellipsoid. We define the following 
optimization problem: 



max log det(A T diag(z)A) , 

zG{0,l}" 



such that 



5 T 1 = k. 



k < 



where 1 is the appropriately sized all ones vector. This problem 
is known to be a boolean convex problem [3]. 



Given 



feasible starting point xq and tolerance e > 0, k = 1 



Repeat 1 Compute the Newton step and decrement 



AZ : 



2 Stopping criterion: quit if A /2 < e; 

3 Line search: Choose step size t by backtracking line search; 

4 Update: := x^_! + tAz, k = k + 1. 



TABLE I 

The constrained Newton algorithm [1, §10.2.2] 



B. Relaxed problem 

Recent work by Joshi and Boyd [3] proposes to relax (O to 
allow a fractional < < 1 to get a relaxed version of the 
problem: 

max logdGt(^ T diag(z)A) , (3) 



ze[o,i] 
such that 



z T l 



k < m . 



A common technique for solving the relaxed problem is the 
log barrier method. In the log barrier method the constrains 
< zt < 1 are incorporated into the cost function 

m 

max log detM T diag(z)A) + k > (logfzi) + logfl — zA) , 

-610.1]- ^ 

(4) 

such that z T l = k , k < m , 

where k > is a weighting parameter that controls the 
accuracy of the approximation. The approximated function 
is concave and smooth and thus it is possible to use the 
Newton method [1, §10.2.2] efficiently. Table H-Al outlines the 
constrained Newton method. Starting from an initial feasible 
point Xo, at each Newton step we compute the search direction 
Az. Then a backtracking line search [1, §9.2] is used to 
compute a step size t. The current solution x^ is replaced by 
Xfc + tAz. We stop when the Newton decrement is small. 

For computing Az we need to compute the Hessian H and 
the gradient g. The gradient is given by 

g = - diag( J 4Jf/i T ) + K h, (5) 

where 

X = (A T diag(z)A)" 1 , hi = l/zi+l/(l-Zi), Vi = l...m. 
The Hessian is given by 

H = -(AXA T ) o (AXA T ) - diag(Kp) , (6) 
where o is the Hadamard (elementwise) product, 

Pl = l/z 2 + 1/(1- z t ) 2 , Vz =l...m. 



(2) Finally the search direction Az is computed by 



Az = -H^g+i 



l T if _1 g 



H~ l l. 



(7) 



II. Gaussian belief propagation 

We wish to compute the maximum a posteriori (MAP) 
estimate of a random vector x with Gaussian distribution (after 
conditioning on measurements): 



p(x) oc cxp{ — ix T Jx + h T x} , 



(8) 



where J >- is a symmetric positive definite matrix (the 
information matrix) and h is the potential vector. This problem 
is equivalent to solving Jx = h for x given (h, J) or to solve 
the convex quadratic optimization problem: 



minimize f(x) = ^x T Jx — h T x . 



(9) 



GaBP is an efficient distributed message-passing algorithm 
for inference over a Gaussian graphical model. Given the 



# 


Stage 


Operation 


1. 


Initialize 


Set ciij = and j3y = 0, V(i, j) £ 5 


2. 


Iterate 


For all (j, j) 6 g 

a Ai = J™ + EfcgN(i)\j Q fei 

= hi + 53 fcsN (i)\j /9fci 
Qi j = — J^.aT 1 

end 


3. 


Check 


If a's and /?'s have converged, 
continue to #4. Else, return to #2. 


4. 


Infer 


#i = [Jii + EfcgN(i) 

Ai = Ki(hi + Efcgn(i) /3fei). 


5. 


Output 


x* = Ai> Vi. 



TABLE II 

Computing x* = argmax x exp(— |x T Jx + h T x) via GaBP. 



Gaussian density function ([8]) or objective function (O, we are 
interested in calculating the marginal densities, which must also 
be Gaussian, 

p(xi) ~ 7V(pt 4 = (J-^Ki 4 (J _1 )ji) , 

where and A'i are the marginal mean and variance, respec- 
tively. The GaBP update rules are summarized in Table HIT We 
write N(i) to denote the set of neighbors of node i (non zero 
entries in J at row i. 

It is known that if GaBP converges, it results in the exact 
MAP estimate x*, although the variance estimates Ki com- 
puted by GaBP are only approximations to the correct variances 
[8]. 

III. First algorithm: distributed interior point 

METHOD 

Following our previous work [9], [10], we propose a way to 
distribute the Newton method for solving the sensor selection 
problem. This construction serves mainly for comparing the 
computational overhead and the accuracy with our approx- 
imation algorithm, presented in the next section. There are 
several challenges that make the sensor selection problem 
harder to distribute than our previous work [10] that addressed 
a distributed version of standard linear programming. First, the 
computation of the gradient, (|5j, involves inverting a matrix Q 
and computing the full inverse, which is needed for computing 
the Hessian. Furthermore, the search direction Az computation, 
©, requires solution of two systems of linear equations. 
Typically in linear programming the computation of the search 
directions involves a solution of single linear system of equa- 
tions and not the full inversion of the Hessian matrix. Second, 
the linear iterative algorithms, which are used for efficiently 
solving the linear system of equations empirically, failed to 
converge. To this end, we deploy our recent construction [11] 
for forcing convergence of the iterative algorithms to the correct 
solution. 

Note that the distributed algorithm is not an approximation, 
it solves the relaxed sensor selection problem (01 accurately. 
However, there is higher computational cost relative to the 
approximation algorithm presented in next section. 



A. Distributed computation of the gradient 

At each Newton step, we would like to compute the follow- 
ing gradient 

g = - diag(A(A T diag(z)A) _1 A T ) + «h . 

We propose to utilize the GaBP algorithm for computing the 
matrix inverse (A T diag(z)A) -1 using the following construc- 
tion. Denoting Q = (A T diag(z)A), we solve n instances of 
linear systems of equations, where the i-th system, Qri = 
ei, r, <E R n , is the solution, and is a vector with 1 at 
the i-th position and zero elsewhere. This is equivalent to 
computing since for each i we compute = Q _1 e;, 

which is the i-th row of the required solution. Note that the 
computation can be done in parallel, since the i-th equation 
does not depended on the solution of the other equations. To 
distribute this computation, each computing node gets one row 
of the matrix Q and the matching entry of the vector ej. It is 
known that when the GaBP algorithm converges it converges 
to the correct solution, so Q" 1 is not an approximation. After 
computing Q _1 the gradient is computed by multiplying by 
A(Q~ 1 )A T , a computation that is fairly easy to distribute. 
Finally, the diagonal entry is selected, di&g(A(Q~ 1 )A T ), and 
each computing node adds the scalar nhi to its matching 
gradient entry. The result of this computation is that gi is stored 
in the i-th computing node. 

B. Computing the Hessian and search direction 

After computing the gradient, each computing node has the 
matching row of AXA T . The Hessian is computed by mul- 
tiplying the matrices AXA T using the elementwise product: 
H = —(AXA T ) o (AXA T ) - diag(«p). The result of this 
operation is that each computing node has the matching row 
in the Hessian. 

For computing the search direction Az, ©, two linear 
systems of equations are solved: H~ 1 g and ff — 1 1. Again, we 
use the GaBP algorithm for computing these solutions. These 
two computation can be done in parallel, as well. 

C. Convergence and overhead 

We use the recent construction of Johnson et aJ. [11] for 
forcing convergence of a linear iterative algorithm to the 
correct solution. The construction is applied in three places: 
in computing the inverse matrix Q _1 and in solving -ff _1 g and 
H- 1 !. 

Next we discuss the overhead of the distributed Newton 
method. Typically there are around 10-20 Newton iterations 
until convergence. In each Newton step, in the gradient com- 
putation the inversion of the matrix Q requires solving n 
linear systems of equations (which can be solved in parallel). 
In theory, a logarithmic number of iterations is required for 
convergence. An upper bound on converge speed is presented 
in [10]. In total, the cost of inverting Q is 0(n 3 log(n)). Note 
that this cost is higher than the traditional Gaussian elimination 
0(n 3 ), however we require only 0(log(n)) communication 
rounds, relative to n communication rounds that are needed 
for a distributed implementation of Gaussian elimination. The 
computation of the search direction involves a solution of 



two linear systems of equations with dominating overhead of 
0(m 2 log(m)). 

The above analysis is for the worst case. In case the matrix 
Q is sparse, further speedup can be obtained. In practice, fast 
convergence (tens of iterations) of the GaBP algorithm was 
observed in problem sizes of millions of variables [12]. Note, as 
mentioned above, the distributed Newton method is presented 
here mainly for reference. 

IV. Second algorithm: fast approximation 

ALGORITHM 

Our goal is to allow a distributed and efficient computation 
of for solving the approximate problem. We propose an 
approximate computation (truncated Newton method) that is 
composed of two steps: approximation of the gradient computa- 
tion (0, and an approximation of the Hessian ((5). In SectionlVll 
we show that our approximation is comparable in performance 
to the original centralized algorithm. 

A. Approximation of the gradient 

Theorem 1: The gradient, (0, can be computed by inverting 
the following matrix: 

A T 

A Z 



E = 



(10) 



and taking the diagonal of the lower right block, where Z = 
— diag(z) -1 . 

Proof: Using the Schur complement of the matrix M 



A B 
C D 



the lower right block of M 1 is 



Y = D' 1 + D~ 1 C(A - BD~ 1 C)BD~ 1 . 

Substituting A = 0, B = A T , C = A, D = -Z- 1 we get 

Y = -Z + ZA(0 + A T ZAy 1 A T Z. (11) 

Multiply by Z~ 2 to get Z^Y = Z~ x + A(A T ZA)~ 1 A T . 
Equivalently, A{A T ZA)^ A T = Z- 2 (Y + Z). Finally the 
gradient is given by 



g= -diag(Z- 2 (Y + Z)) + nh. 



(12) 



Theorem [T] shows an alternative way of accurately computing 
the gradient in each Newton step. The following theorem 
presents how to compute dT3T > approximately. 

Theorem 2: The gradient, dol l, can be computed approxi- 
mately and distributively using the GaBP algorithm. 

Proof: Given the matrix E defined in (TTOb we construct 
the following multivariate probability distribution 

p(x) = cxp(-ix T _Ex + d T x) . 

It is shown in [8] that the GaBP algorithm, when converging, 
computes the exact MAP assignment of x, which is the 
vector E^d, as well as an approximation to the variance 
estimates K{ « diag(-E _1 ). We are interested only in the m 
last entries of K n+ i . . . K n+m+ i, which is an approximation 
to the main diagonal of Y as defined in dTTb . We denote 
this approximation (the output of the GaBP algorithm) as 



Y = (K n+ \ . . . K n+m+ i) G M. m . Now we can compute the 
approximated gradient: 

g = -diag(Z- 2 (f + Z)) + K h. (13) 



B. Approximating the Hessian and computing the search di- 
rection 

We propose to use a truncated Newton method, where only 
the main diagonal of the Hessian is computed. 

H = - diag(g o g) - diag(Kp) . 

Next, we need to compute the search direction (step 1 in Table 
iTAt : 

Since H is diagonal, this computation can be done in 0{m). 
Note that our construction for approximating the Hessian is 
general and can be used in other problems as well. 

C. Cost analysis 



Algorithm 


Sensor Selection [3] 


Fast Approximation 


Gradient computation 


0(n<) 


0(mn log(m + n)) 


Hessian computation 


0(m z ) 


O(m) 


Search direction 


0(m z ) 


O(m) 


Total cost 


0(n z +m i ) 


0(mnlog(m)) 



TABLE III 
Algorithm cost comparison 

Computing the Newton step in [3] is dominated by invert- 
ing A T diag(z)A in the gradient computation, which costs 
0(n 3 ). Computing the Hessian involves computing the product 
(AXA T ) o (AXA T ), which costs 0(m 2 ). Computing the 
search direction Az costs another 0(m 2 ). In total we get 
0(n 2 + m 3 ). In contrast, in our method we compute an 
approximation of the gradient by running the GaBP algorithm. 
Assuming A is dense, we get 0{mn) non-zero elements 
in the matrix E. When the GaBP algorithm converges, the 
number of iterations is typically logarithmic in the matrix size 
(convergence rate analysis can be found on [10]). The Hessian 
is computed by its diagonal approximation, which costs only 
0(m). The search direction computation takes another 0(m). 
In total the heaviest operation is the gradient computation, 
which dominates the total complexity. It is important to note 
that our method is an approximation to the method proposed 
in [3], where its efficiency is discussed. 

V. Relation to the minimum volume enclosing 

ELLIPSOID PROBLEM 

The minimum volume enclosing ellipsoid problem (MVEE) 
is formulated as follows. Given m vectors of dimension n, we 
aim at finding the smallest ellipsoid that encloses all vectors: 



logdetM, 

such that &J M &i < 1 Vi = 1, . . . , m , 



mm 
AfeS" 



(14) 



where S™ is the set of symmetric n x n matrices. The following 
theorem establishes the connection between the dual of the 
MVEE and the sensor selection problem. This means that if we 
have an efficient algorithm for computing the sensor selection 
problem we can also solve the MVEE problem efficiently. The 
relation to the dual was also observed in [3], but here we 
provide an alternative and simpler proof. 

Theorem 3: The dual of the MVEE is related to the relaxed 
sensor selection problem, (O, where the number of selected 
sensors is k = n. 

Proof: We start by forming the Lagrangian 

n 

L(M, A) = — log det M + ^ z,(l - af Ma;) , (15) 

t=i 

where z > is a vector of Lagrange multipliers. Now we 
compute the optimality conditions by computing the derivative 
and comparing it to zero. 

^^ = -A/- + ^diag(z)A = 0, 

M _1 = A T diag(z)A , (16) 

oz ^-^ 

1=1 
n 

i=i 

The dual of the MVEE problem is obtained by substituting 
(fT6b and dT7j into (Q3]l to get 

max log det A T diag(z)A , 

z 

such that zl = n , z > . 

Notice the relation to (O, where the number of sensors is k 
n. The only difference is that in the sensor selection proble 
z £ [0, 1]™, whereas here z > 0. 

VI. Experimental results 

We evaluate our proposed algorithms using two datasets. Tl 
first one is a synthetic dataset, which is borrowed from [3], ai 
the second one is real data acquired from the Abilene Internets 
backbone network [13]. Our simulation is heavily based on the 
Matlab simulation of [3], and is available on the web [14]. In 
all experiments we have used a Newton tolerance of 1CP 3 and 
the GaBP threshold was set to 1CT 8 . 

A. Synthetic data 

Following [3] we use m = 100 potential sensors and n = 20 
parameters to estimate. The m measurement vectors S R™ 
are chosen randomly, and independently, from the distribution 
7V(0, We solve the relaxed problem, (01, with distributed 

Newton construction as well as our distributed approximate 
algorithm (denoted as GaBP in the graphs). Figure 1 shows 
the duality gap of the different methods tested. The j/-axis 
presents the duality gap of the cost function, the a>axis ticks are 
the number of sensors selected. As predicated, the distributed 
Newton method has performs better than GaBP, because GaBP 
computes an approximation to the relaxed problem. However, 
when deploying the local search heuristic proposed in [3] 



we get a very good solution, which outperforms the Newton 
method. 

Figure 1 (top) plots some bounds on the solution quality, 
where upper bound is computed using the fact that the relaxed 
sensor selection problem solution is at most 2mr from the 
optimal solution [1, §10.4], the lower bound is computed using 
the simple selection rule. Improved lower bounds are obtained 
via the local optimization procedure. A simple method to 
carry this local optimization is to start from the k selected 
sensors, and check sensor selections that can be derived from 
by swapping one of the chosen sensors with one of the to — k 
sensors not chosen. 

Typically the number of Newton iterations in both algorithms 
are 4-8. As expected, the approximated algorithm converged 
very rapidly, taking 9-10 iterations for each Newton step. 
In contrary, the full Newton method required 20-50 GaBP 
iterations for each row of Q^ 1 (those iteration can be done in 
parallel by increasing message sizes in the network). The con- 
vergence enforcement construction used in the Newton method 
required up to 25 iterations of the outer loop. To conclude, 
the approximated algorithm requires tens of iterations, while 
the accurate distributed Newton method requires hundreds of 
iterations. 




Figure 1. Top: Synthetic data, bottom: Abilene data. 

B. Abilene network data 

We have downloaded data with daily activity statistics of 
Abilene Internet2 backbone available from [13]. The data 
reported was collected between June 2006 to December 2006, 
total of to = 153 days. In each day the average daily utilization 
of every Abilene network inbound and outbound links is given. 
Since there is a high correlation between inbound and outbound 
links, we have used only the inbound link activity. After 
filtering out links that where not active during the full examined 
period, we remained with n = 89 links. Our goal is to select the 
k days that characterize normal network operation, for making 
a distinction with the other m — k days that may indicate 
abnormal network behavior. Figure 1 (bottom) outlines the 
performance of the GaBP algorithm vs. the distributed Newton 
method. The x-axis presents k, the number of days selected. 
The y-axis represents the duality gap and the upper bounds 



respectively. As expected, the distributed Newton method has 
higher accuracy, but heaver computational overhead. As a 
sanity check, we have compared the number of error tickets 
opened each day and found a high correlation between them 
and the output of the distributed approximate algorithm. It is 
interesting to note that relative to the synthetic data set, the 
sensor selection problem becomes more difficult in the sense 
that the approximation algorithm is less accurate. This can 
be explained that the underlying distribution used by [3] is 
multivariate Gaussian with diagonal inverse covariance matrix, 
while real Internet traffic typically does not come from a 
Gaussian distribution. 

VII. Conclusion 

We have presented two distributed algorithms for solving 
the sensor selection problem. The first one is an accurate 
distributed version of the interior point method proposed in 
[3]. The second is a fast and light-weight approximation 
algorithm. Using simulations we have analyzed and compared 
to performance of both algorithms, using synthetic and real 
data collection from the Abilene Inetrnet2 backbone network. 
We believe there are several important applications to our novel 
construction for example in the area of quantitative database 
cleaning or distributed anomaly detection in communication 
networks. 
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